home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Skunkware 5
/
Skunkware 5.iso
/
src
/
X11
/
wais
/
ir
/
DESIGN
< prev
next >
Wrap
Text File
|
1995-05-09
|
4KB
|
96 lines
Serial Indexer Overview
Brewster Kahle
The serial indexer is a simple inverted file system that is not very
different from existing IR systems.
DATABASE FILES
The serial indexer parses files and creates an inverted file index made up
of 7 files. For a database named "index" the files would be:
* index.inv -- the "postings file" that is a term followed by a list of
entries each of which describe where that word occurs in the original files.
A posting is a weight, doc_number (see the doc file), and character_position.
This file is indexed with the dictionary (dct) file. The terms are in
alphabetical order.
* index.doc -- this is a linear list of document-entries one for each
document. A document can be a complete file or a piece of a file (such as
mail files that are the concatentation of many messages, each message is a
document). The information kept in each entry is:
filename_id: position into the filename file (fn) of the filename
for this document.
headline_id: position in the headline file.
start_character: position in the file where this document starts
end_character: end position. 0 if complete file.
document_length: in characters
number_of_lines
date: time_t
* index.fn -- list of the filenames in the database with the write-date
of the file and the type of the file. Type is a string. Indexed by the
position in the file, so this file can not be edited after the index is built.
* index.hl -- list of the headlines. Indexed by position.
* index.dct -- dictionary file which is a 2 level b-tree. The first block
is pointers to the every 1000th entry in the rest of the dictionary file.
Each entry is a fixed-length record of the word with the position into the
rest of the file. The rest of the file are blocks just like the first
block, but each entry is the word plus the position of it in the inverted
file (inv). The whole dictionary is in alphabetical order.
* index.src -- source description that is used to access the database.
It is also returned as a response to the "help" query. This file is not
overwritten once it is created. Therefore database maintainers should edit
this file to add a good desciption of what that database contains.
* index.status -- only the ram based seeker uses this to describe itself
and get parameters from the user.
INDEXING
A new index is built by parsing the input files, finding the words in it
and creating the filename, headline, and doc tables at parse time. The
words are then passed to routines defined in irext.h which define the
frontend/backend boundary.
The serial indexer creates intermediate inv files starting at inv0 then
inv1 etc. These are created by accumulating the words in a memory
hashtable. Each invN file is in alphabetical order, so they can be merged
easily into one inv file. This merging is done logarithmically, so that it
is fast. Unfortunately, it means that just before the merging is done, the
index occupies twice the space that it will finally.
On the final merge, a dictionary file is produced by dumping the positions
of the start of terms in the final inverted file.
Adding to an exising database is done by adding to the filename, headline,
and doc tables as the parsing is done, but waiting to merge the new invN
files into the old inv file until all the new merging is done.
Theoretically it should be able to be done while always having the database
usable.
SEARCHING
A query is parsed for its seed words and relevant documents and is passed
to a function search_word (defined in irext.h). The backend does whatever
it will do to search and set some state. Then best_hit is called
iteratively to find the documents that most closely matched the query.
These results are used to look up headlines and filenames to hand back to
the user.
The serial backend does this by loading the postings for a particular term
and then adding it into a score array. The score array has an entry for
each document. A document gets its score increased by containing terms in
the query.
Roughly the goals of this design were:
Make a flexible platform for experimentation
Portable
Low search overhead for fast startup and lightweight access
Usable in pieces for multiple search engines backends
Fast searching
Easy to implement
Size of the resulting database was not a priority